Extract Topics from Documents (LDA) (Operator Toolbox)

Synopsis

This operator finds topics using the LDA method.

Description

LDA (Latent Dirichlet Allocation) is a method which allows you to identify topics in documents. This implementation of LDA uses the ParallelTopicModel of the Mallet library (source: Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009)) with SparseLDA sampling scheme and data structure (source: Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)).

LDA provides topic diagnostics in the model object. For details on the measures see: http://mallet.cs.umass.edu/diagnostics.php . Note that some measures depend on the number of top words.

LDA uses Gibbs Sampling for the application of the model. The method exposes additional parameters in the application.

Input

col (Collection)
A preprocessed collection of documents.

Output

exa (Data table)
An ExampleSet with added "documentId" and "TopicId" attributes, and an additional attribute showing the confidence that this document belongs to the topic.
top (Data table)
An ExampleSet with details on the topic. For each topic the operator returns the top 5 most used words.
mod
The topic model. It can be applied to new collection of documents using Apply Model (Documents).
per (Averagable)
The LogLikelihood value of the fit which can be used for optimization.

Parameters

number of topics Number of topics to search.
use alpha heuristics If this parameter is set to true, alpha is automatically set. The used heuristics is: 50 / Number of topics.
alpha sum Bayesian prior on the topic distribution.
use beta heuristics If this parameter is set to true, beta will be automatically set. The used heuristics is: 50 / Number of words.
beta Bayesian prior on the word distribution.
optimize hyperparameters If this parameter is set to true, both alpha and beta will be optimized every k-th step. k can be provided by the "optimize interval for hyperparameters" parameter
optimize interval for hyperparameters Frequency of hyperparameter optimization.
top words per topic Number of words to pull to describe one topic.
iterations Number of iterations for optimization.
reproducible If this parameter is set to true, parallel execution will be deactivated. Results may differ between runs if this is left unchecked.
enable logging If this parameter is set to true, additional output is provided in the Log panel.
use local random seed This parameter indicates if a local random seed should be used.
local random seed If the use local random seed parameter is checked this parameter determines the local random seed.
include meta data If checked, available meta information of the text like filename, date is added as attribute.
LDA.iterations (Application) Number of iterations for Gibbs sampling. Available in Apply Model (Documents).
LDA.burnin (Application) Ignore the first x rounds of sampling. Should be > iterations. Available in Apply Model (Documents).
LDA.thinning (Application) Only use every x-th iteration to determine the confidence. Available in Apply Model (Documents).

Tutorial Processes

A simple application on lorem ipsum

This sample process is a minimalist example for LDA. It generates a collection of documents based on Lorem Ipusm, processes them using the Text Processing extension, and feeds it into the LDA operator.

Categories

Versions